CANCORR
Overview
The CANCORR function performs Canonical Correlation Analysis (CCA), a multivariate statistical technique that identifies and measures the linear relationships between two sets of variables. First introduced by Harold Hotelling in 1936, CCA finds linear combinations of each variable set that maximize the correlation between them, producing canonical variates — pairs of composite variables with the highest possible correlation.
Given two sets of variables X = (x_1, \ldots, x_n) and Y = (y_1, \ldots, y_m), CCA seeks weight vectors a and b such that the correlation between U = a^T X and V = b^T Y is maximized. Subsequent pairs of canonical variates are derived with the constraint that they are uncorrelated with all previous pairs. The number of canonical correlations equals \min(n, m).
This implementation uses singular value decomposition (SVD) via the statsmodels CanCorr class. The algorithm computes canonical correlations by solving an eigenvalue problem on the cross-covariance structure of the standardized variables. The function returns:
- Canonical correlations: Values ranging from 0 to 1 indicating the strength of each canonical relationship
- Eigenvalues: Computed from canonical correlations as \lambda = r^2 / (1 - r^2)
- Wilks’ lambda: A multivariate test statistic for the null hypothesis of no correlation
- Chi-square statistics: Bartlett’s approximation for hypothesis testing:
\chi^2 = -\left(n - 1 - \frac{p + q + 1}{2}\right) \ln(\Lambda)
where n is the number of observations, p and q are the number of variables in each set, and \Lambda is Wilks’ lambda.
CCA is widely used in psychology, ecology, marketing research, and bioinformatics to explore relationships between measurement batteries, such as comparing personality inventories or linking gene expression data to phenotypic outcomes. For additional background, see the Wikipedia article on Canonical Correlation and the statsmodels multivariate documentation.
This example function is provided as-is without any representation of accuracy.
Excel Usage
=CANCORR(x_vars, y_vars, standardize)
x_vars(list[list], required): First set of variables where rows are observations and columns are variables.y_vars(list[list], required): Second set of variables where rows are observations and columns are variables.standardize(bool, optional, default: true): Whether to standardize variables (mean=0, std=1) before analysis.
Returns (list[list]): 2D list with canonical correlations, or error message string.
Examples
Example 1: Demo case 1
Inputs:
| x_vars | y_vars | ||
|---|---|---|---|
| 1 | 2.5 | 2.1 | 1.5 |
| 2 | 3.2 | 3 | 2.8 |
| 3 | 4.1 | 4.2 | 3.1 |
| 4 | 5.3 | 5.1 | 4.5 |
| 5 | 6 | 6 | 5.2 |
Excel formula:
=CANCORR({1,2.5;2,3.2;3,4.1;4,5.3;5,6}, {2.1,1.5;3,2.8;4.2,3.1;5.1,4.5;6,5.2})
Expected output:
"non-error"
Example 2: Demo case 2
Inputs:
| x_vars | y_vars | standardize | ||||
|---|---|---|---|---|---|---|
| 1.2 | 2.8 | 1.9 | 3.4 | 2.1 | 1.6 | true |
| 2.3 | 3.5 | 2.4 | 4.2 | 3.3 | 2.5 | |
| 3.1 | 4.2 | 3.7 | 5.3 | 4.5 | 3.2 | |
| 4.5 | 5.1 | 4.2 | 6.2 | 5.2 | 4.7 | |
| 5.3 | 6.4 | 5.6 | 7.1 | 6.5 | 5.3 | |
| 6.7 | 7.3 | 6.1 | 8.3 | 7.2 | 6.8 | |
| 7.2 | 8.1 | 7.4 | 9.1 | 8.4 | 7.2 |
Excel formula:
=CANCORR({1.2,2.8,1.9;2.3,3.5,2.4;3.1,4.2,3.7;4.5,5.1,4.2;5.3,6.4,5.6;6.7,7.3,6.1;7.2,8.1,7.4}, {3.4,2.1,1.6;4.2,3.3,2.5;5.3,4.5,3.2;6.2,5.2,4.7;7.1,6.5,5.3;8.3,7.2,6.8;9.1,8.4,7.2}, TRUE)
Expected output:
"non-error"
Example 3: Demo case 3
Inputs:
| x_vars | y_vars | ||
|---|---|---|---|
| 1 | 1.5 | 1.8 | 2.1 |
| 2 | 2.2 | 2.7 | 3.3 |
| 3 | 3.8 | 3.5 | 4.2 |
| 4 | 4.5 | 4.6 | 5.1 |
| 5 | 5.7 | 5.4 | 6.5 |
| 6 | 6.3 | 6.9 | 7.2 |
| 7 | 7.9 | 7.3 | 8.4 |
Excel formula:
=CANCORR({1,1.5;2,2.2;3,3.8;4,4.5;5,5.7;6,6.3;7,7.9}, {1.8,2.1;2.7,3.3;3.5,4.2;4.6,5.1;5.4,6.5;6.9,7.2;7.3,8.4})
Expected output:
"non-error"
Example 4: Demo case 4
Inputs:
| x_vars | y_vars | standardize | ||
|---|---|---|---|---|
| 1 | 2.5 | 2.1 | 1.5 | false |
| 2 | 3.2 | 3 | 2.8 | |
| 3 | 4.1 | 4.2 | 3.1 | |
| 4 | 5.3 | 5.1 | 4.5 | |
| 5 | 6 | 6 | 5.2 |
Excel formula:
=CANCORR({1,2.5;2,3.2;3,4.1;4,5.3;5,6}, {2.1,1.5;3,2.8;4.2,3.1;5.1,4.5;6,5.2}, FALSE)
Expected output:
"non-error"
Python Code
import math
from statsmodels.multivariate.cancorr import CanCorr as statsmodels_cancorr
def cancorr(x_vars, y_vars, standardize=True):
"""
Performs Canonical Correlation Analysis (CCA) between two sets of variables.
See: https://www.statsmodels.org/stable/generated/statsmodels.multivariate.cancorr.CanCorr.html
This example function is provided as-is without any representation of accuracy.
Args:
x_vars (list[list]): First set of variables where rows are observations and columns are variables.
y_vars (list[list]): Second set of variables where rows are observations and columns are variables.
standardize (bool, optional): Whether to standardize variables (mean=0, std=1) before analysis. Default is True.
Returns:
list[list]: 2D list with canonical correlations, or error message string.
"""
def to2d(x):
return [[x]] if not isinstance(x, list) else x
def validate_2d_array(arr, name):
# Validate that arr is a 2D list of numeric values
if not isinstance(arr, list):
return f"Invalid input: {name} must be a 2D list."
if len(arr) == 0:
return f"Invalid input: {name} must not be empty."
for i, row in enumerate(arr):
if not isinstance(row, list):
return f"Invalid input: {name} must be a 2D list."
if len(row) == 0:
return f"Invalid input: {name} rows must not be empty."
for j, val in enumerate(row):
if not isinstance(val, (int, float, bool)):
return f"Invalid input: {name}[{i}][{j}] must be numeric."
num_val = float(val)
if math.isnan(num_val) or math.isinf(num_val):
return f"Invalid input: {name}[{i}][{j}] must be finite."
# Check that all rows have the same length
row_lengths = [len(row) for row in arr]
if len(set(row_lengths)) > 1:
return f"Invalid input: {name} must have consistent row lengths."
return None
# Normalize inputs
x_vars = to2d(x_vars)
y_vars = to2d(y_vars)
# Validate inputs
error = validate_2d_array(x_vars, "x_vars")
if error:
return error
error = validate_2d_array(y_vars, "y_vars")
if error:
return error
# Validate standardize
if not isinstance(standardize, bool):
return "Invalid input: standardize must be a boolean."
# Check that x_vars and y_vars have the same number of rows
if len(x_vars) != len(y_vars):
return "Invalid input: x_vars and y_vars must have the same number of observations (rows)."
# Check minimum number of observations
n_obs = len(x_vars)
n_x_vars = len(x_vars[0])
n_y_vars = len(y_vars[0])
if n_obs < max(n_x_vars, n_y_vars) + 1:
return "Invalid input: number of observations must be greater than the number of variables."
try:
# Perform canonical correlation analysis
cca = statsmodels_cancorr(x_vars, y_vars, standardize=standardize)
# Get test results
corr_test = cca.corr_test()
# Build output table
output = []
# Header row
output.append([
'canonical_variate',
'correlation',
'eigenvalue',
'wilks_lambda',
'chi_square',
'df',
'p_value'
])
# Results for each canonical correlation
n_cv = len(cca.cancorr)
for i in range(n_cv):
# Calculate eigenvalue from canonical correlation
r = float(cca.cancorr[i])
eigenval = (r * r) / (1.0 - r * r) if r < 1.0 else float('inf')
# Get Wilks' lambda from test results
wilks = float(corr_test.stats.loc[i, "Wilks' lambda"])
# Calculate chi-square using Bartlett's approximation
chi_sq = -(n_obs - 1.0 - (n_x_vars + n_y_vars + 1.0) / 2.0) * math.log(wilks) if wilks > 0 else float('inf')
# Get degrees of freedom and p-value
df = float(corr_test.stats.loc[i, 'Num DF'])
pval = float(corr_test.stats.loc[i, 'Pr > F'])
row = [
i + 1, # canonical variate number
r, # canonical correlation
eigenval, # eigenvalue
wilks, # Wilks' lambda
chi_sq, # chi-square
df, # degrees of freedom
pval # p-value
]
output.append(row)
# Add blank row separator
output.append([''] * 7)
# Add X coefficients section
output.append(['X Coefficients'] + [''] * 6)
x_coef_header = ['Variable'] + [f'CV{j+1}' for j in range(n_cv)] + [''] * (7 - n_cv - 1)
output.append(x_coef_header[:7])
for i in range(n_x_vars):
row = [f'X{i+1}'] + [float(cca.x_cancoef[i, j]) for j in range(n_cv)] + [''] * (7 - n_cv - 1)
output.append(row[:7])
# Add blank row separator
output.append([''] * 7)
# Add Y coefficients section
output.append(['Y Coefficients'] + [''] * 6)
y_coef_header = ['Variable'] + [f'CV{j+1}' for j in range(n_cv)] + [''] * (7 - n_cv - 1)
output.append(y_coef_header[:7])
for i in range(n_y_vars):
row = [f'Y{i+1}'] + [float(cca.y_cancoef[i, j]) for j in range(n_cv)] + [''] * (7 - n_cv - 1)
output.append(row[:7])
return output
except ValueError as e:
return f"Calculation error: {e}"
except Exception as e:
return f"Calculation error: {e}"